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(57) Abstract: The method, system and programming 
language of the present invention, provide for 
program constructs, such as commands, declarations, 
variables, and statements, which have been developed 
to describe computations for an adaptive computing 
architecture, rather than provide instructions to a 
sequential microprocessor or DSP architecture. The 
invention includes program constructs that permit a 
programmer to define data flow graphs in software, to 
provide for operations to be executed in parallel, and 
to reference variable states and historical values in a 
straightforward manner. The preferred method, system, 
and programming language also includes mechanisms 
for efficiently referencing array variables, and enables 
the programmer to succinctly describe the direct data 
flow among matrices, nodes, and other configurations of 
computational elements and computational units forming 
the adaptive computing architecture. The preferred 
programming language includes dataflow statements, 
channel objects, stream variables, state variables, unroll 
statements, iterators, and loop statements. 
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METHOD, SYSTEM AND LANGUAGE STRUCTURE FOR 
PROGRAMMING RECONFIGURABLE HARDWARE 



Field of the Invention 

5 The present invention relates, in general, to software and code languages 

used in programming hardware circuits, and more specifically, to a method, system, and 
language command or statement structure for defining adaptive computational units in 
reconfigurable integrated circuitry. 

10 Cross-Reference to Rela ted Applications 

This application is related to Paul L. Master et al., U. S. Patent Application 
Serial No. 09/815,122, entitled "Adaptive Integrated Circuitry With Heterogeneous And 
Reconfigurable Matrices Of Diverse And Adaptive Computational Units Having Fixed, 
Application Specific Computational Elements", filed March 22, 2001, commonly 

15 assigned to Quicksilver Technology, Inc., and incorporated by reference herein, with 
priority claimed for all commonly disclosed subject matter (the "first related 
application"). 

This application is related to Paul L. Master et al., U. S. Patent Application 
Serial No. 09/997,530, entitled "Apparatus, System and Method For Configuration Of 
20 Adaptive Integrated Circuitry Having Fixed, Application Specific Computational 

Elements", filed November 30, 2001, commonly assigned to Quicksilver Technology, 
Inc., and incorporated by reference herein, with priority claimed for all commonly 
disclosed subject matter (the "second related application"). 



25 Background of the Invention 

The first related application discloses a new form or type of integrated 
circuitry which effectively and efficiently combines and maximizes the various 
advantages of processors, application specific integrated circuits ("ASICs"), and field 
programmable gate arrays ("FPGAs"), while rninimizing potential disadvantages. The 

30 first related application illustrates a new form or type of integrated circuit ("IC"), referred 
to as an adaptive computing engine ("ACE"), which provides the programming flexibility 
of a processor, the post-fabrication flexibility of FPGAs, and the nigh speed and high 
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utilization factors of an ASIC. This ACE integrated circuitry is readily reconfigurable, is 
capable of having corresponding, multiple modes of operation, and further minimizes 
power consumption while increasing performance, with particular suitability for low 
power applications, such as for use in hand-held and other battery-powered devices. 
5 Configuration information (or, equivalently, adaptation information) is 

required to generate, in advance or in real-time (or potentially at a slower rate), the 
adaptations (configurations and reconfigurations) which provide and create one or more 
operating modes for the ACE circuit, such as wireless communication, radio reception, 
personal digital assistance ("PDA"), MP3 music playing, or any other desired functions. 
10 The second related application discloses a preferred system embodiment 

that includes an ACE integrated circuit coupled with one or more sets of configuration 
information. This configuration (adaptation) information is required to generate, in 
advance or in real-time (or potentially at a slower rate), the configurations and 
reconfigurations which provide and create one or more operating modes for the ACE 
1 5 circuit, such as wireless communication, radio reception, personal digital assistance 

("PDA"), MP3 or MP4 music playing, or any other desired functions. Various methods, 
apparatuses and systems are also illustrated in the second related application for 
generating and providing configuration information for an ACE integrated circuit, for 
determining ACE reconfiguration capacity or capability, for providing secure and 
20 authorized configurations, and for providing appropriate monitoring of configuration and 
content usage. 

As disclosed in the first and second related applications, the adaptive 
computing engine ("ACE") circuit of the present invention, for adaptive or reconfigurable 
computing, includes a plurality of differing, heterogeneous computational elements 

25 coupled to an interconnection network (rather than the same, homogeneous repeating and 
arrayed units of FPGAs). The plurality of heterogeneous computational elements include 
corresponding computational elements having fixed and differing architectures, such as 
fixed architectures for different functions such as memory, addition, multiplication, 
complex multiplication, subtraction, synchronization, queuing, sampling, configuration, 

30 reconfiguration, control, input, output, routing, and field programmability. hi response to 
configuration information, the interconnection network is operative, in advance, in real- 
time or potentially slower, to configure and reconfigure the plurality of heterogeneous 
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computational elements for a plurality of different functional modes, including linear 
algorithmic operations, non-linear algorithmic operations, finite state machine operations, 
memory operations, and bit-level manipulations. In turn, this configuration and 
reconfiguration of heterogeneous computational elements, forming various computational 
5 units and adaptive matrices, generates the selected, higher-level operating mode of the 
ACE integrated circuit, for the performance of a wide variety of tasks. 

This adaptability or reconfigurability (with adaptation and configuration 
used interchangeably and equivalently herein) of the ACE circuitry is based upon, among 
other things, determining the optimal type, number, and sequence of computational 
10 elements required to perform a given task. As indicated above, such adaptation or 

configuration, as used herein, refers to changing or modifying ACE functionality, from 
one functional mode to another, in general, for performing a task within a specific 
operating mode, or for changing operating modes. 

The algorithm of the task, preferably, is expressed through "data flow 
1 5 graphs" ("DFGs"), which schematically depict inputs, outputs and the computational 
elements needed for a given operation. Software engineers frequently use data flow 
graphs to guide the programming of the algorithms, particularly for digital signal 
processing ("DSP") applications. Such DFGs typically have one of two forms, either of 
which are applicable to the present invention: (1) representing the flow of data through a 
20 system where data streams from one module (e.g., a filter) to another module; and (2) 
representing a computation as a combinational flow of data through a set of operators 
from inputs to outputs. 

A dilemma arises when developing programs for adaptive or 
reconfigurable computing applications, as currently there are not any adequate or 
25 sufficient methodologies or programming languages expressly designed for such adaptive 
computing, other than the present invention. High-level programming languages, such as 
C++ or Java, are widely used, well known, and easily maintainable. The languages were 
developed to accommodate a variety of applications, many of which are platform- 
independent, but all of which are fundamentally based upon compiling a sequence of 
30 instructions ultimately fed into processor, microprocessor, or DSP. The program code is 
designed to run sequentially, generally in response to a user-initiated event. However the 
languages have limited capabilities of expressing the concurrency of computing 
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operations, and other features, which may he significant in adaptive computing 
applications. 

Assembly languages, at the other extreme, tightly control data flow 
through hardware elements such as the logic gates, registers and random access memory 
(RAM) of a specific processor, and efficiently direct resource usage. By their very 
nature, however, assembly languages are extremely verbose and detailed, requiring the 
programmer to specify exactly when and where every operation is to be performed. 
Consequently, programming in an assembly language is extraordinarily labor-intensive, 
expensive, and difficult to learn. In addition, as languages designed specifically for 
programming a processor (i.e., fixed processor architecture), assembly languages have 
limited, if any, applicability to or utility for adaptive computing applications. 

In between these extremes, and also very different than a high-level 
language, are hardware description languages (HDLs), that allow a designer to specify the 
behavior of a hardware system as a collection of components described at the structural or 
behavioral level. These languages may allow explicit parallelism, but require the 
designer to manage such parallelism in great detail. In addition, like assembly languages, 
HDLs require the programmer to specify exactly when and where every operation is to be 
performed. 

As a consequence, a need remains for a method and system of providing 
programmability of adaptive computing architectures. A need also remains for a 
comparatively high-level language that is syntactically similar to widely used and well 
known languages like C++, for ready acceptance within the engineering and computing 
fields, but that also contains specialized constructs for an adaptive computing 
environment and for maximizing the performance of an ACE integrated circuit or other 
adaptive computing architecture. 

Summary of the Invention 

The present invention is a programming language, system and 
methodology that facilitate prograrriming of integrated circuits having adaptive and 
reconfigurable computing architectures. The method, system and prograrnming language 
of the present invention provide for program constructs, such as commands, declarations, 
variables, and statements, which have been developed to describe computations for an 
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adaptive computing architecture, rather than provide instructions to a sequential 
microprocessor or DSP architecture. The invention includes program constructs that 
permit a programmer to define data flow graphs in software, to provide for operations to 
be executed in parallel, and to reference variable states and historical values in a 
straightforward manner. The preferred method, system, and programming language also 
includes mechanisms for efficiently referencing array variables, and enables the 
programmer to succinctly describe the direct data flow among matrices, nodes, and other 
configurations of computational elements and computational units forming the adaptive 
computing architecture. The preferred programming language includes dataflow 
statements, channel objects, stream variables, state variables, unroll statements, iterators, 
and loop statements. 

Numerous other advantages and features of the present invention will 
become readily apparent from the following detailed description of the invention and the 
embodiments thereof, from the claims and from the accompanying drawings. 

Brief Descri ption of the Drawings 

Figure 1 is a block diagram illustrating a preferred apparatus embodiment 
in accordance the invention disclosed in the first related application. 

Figure 2 is a block diagram illustrating a recorrfigurable matrix, a plurality 
of computation units, and a plurality of computational elements of the ACE architecture, 
in accordance the invention disclosed in the first related application. 

Figure 3 is a block diagram depicting the role of Q language in 
programming instructions for configuring computational units, in accordance with the 
present invention. 

Figure 4 is a schematic diagram illustrating an exemplary data flow graph, 
utilized in accordance with the present invention. 

Figure 5 is a block diagram illustrating the communication between Q 
language programming blocks, in accordance with the present invention. 

Figures 6 A, 6B and 6C are diagrams providing a useful summary of the Q 
programming language of the present invention. 
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Figure 7 provides a FIR filter, expressed in the Q language for 
implementation in adaptive computing architecture, in accordance with the present 
invention. 

Figure 8 provides a FIR filter with registered coefficients, expressed in the 
5 Q language for implementation in adaptive computing architecture, in accordance with 
the present invention. 

Figures 9 A and 9B provide a FIR filter for a comparatively large number 
of coefficients, expressed in the Q language for implementation in adaptive computing 
architecture, in accordance with the present invention. 

10 

Detailed Description of the Invention 

While the present invention is susceptible of embodiment in many 
different forms, there are shown in the drawings and will be described herein in detail 
specific embodiments thereof, with the understanding that the present disclosure is to be 
15 considered as an exemplification of the principles of the invention and is not intended to 
limit the invention to the specific embodiments or generalized examples illustrated. 

As mentioned above, a need remains for a method and system of providing 
programmability of adaptive computing architectures. Such a method and system are 
provided, in accordance with the present invention, for enabling ready programmability 
20 of adaptive computing architectures, such as the ACE architecture. The present invention 
also provides for a comparatively high-level language, referred to as the Q programming 
language (or Q language), that is designed to be backward compatible with and 
syntactically similar to widely used and well known languages like C++, for acceptance 
within the engineering and computing fields. More importantly, the method, system, and 
25 Q language of the present invention provides new and specialized program constructs for 
an adaptive computing environment and for maximizing the performance of an ACE 
integrated circuit or other adaptive computing architecture. 

The Q language methodology of the present invention, including 
commands, declarations, variables, and statements (which are individually and 
30 collectively referred to herein as "constructs", "program constructs" or "program 

structures") have been developed to describe computations for an adaptive computing 
architecture, and preferably the ACE architecture. It includes program constructs that 
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permit a programmer to define data flow graphs in software, to provide for operations to 
be executed in parallel, and to reference variable states in a straightforward manner. The 
Q language also includes mechanisms for efficiently referencing array variables, and 
enables the programmer to succinctly describe the direct data flow among matrices, 
nodes, and other configurations of computational elements and computational units. Each 
of these new features of the Q language provide for effective programming in a 
reconfigurable computing environment, facilitating a compiler to implement the 
programmed algorithms efficiently in adaptive hardware. While the Q language was 
developed as part of a design system for the ACE architecture, its feature set is not 
limited to that application, and has broad applicability for adaptive computing and other 
potential adaptive or reconfigurable architectures. 

As discussed in greater detail below, with reference to Figures 3 through 9, 
the program constructs of the language, method and system of the present invention 
include: (1) "dataflow" statements, which declare that the operations within the dataflow 
statement may be executed in parallel; (2) "channel" objects, which are objects with a 
buffer for data items, having an input stream and an output stream, and which connect 
together computational "blocks"; (3) "stream" variables, used to reference channel 
buffers, using an index which is automatically incremented whenever it is read or written, 
providing automatic array indexing; (4) "state" variables, which are register variables 
which provide convenient access to previous values of the variable; (5) "unroll" 
statements, which provide a mechanism for a loop-type statement to have a determinate 
number of iterations when compiled, for execution in the minimum number of cycles 
allowed by any data dependencies; (6) "iterators", which are special indexing variables 
which provide for automatic accessing of arrays in a predeterrnined address pattern; and 
(7) "loop" statements, which provide for loop or repeating calculations which execute a 
fixed number of times. 

These program constructs of the present invention have particular 
relevance for programming of the preferred adaptive computing architecture. When the 
program constructs are compiled and converted into configuration information and 
executed in the ACE, various computational units of the ACE architecture are configured 
or "called" into existence, executing the program across both space and time, such as for 
parallel execution of a dataflow statement. As a consequence, the ACE architecture is 
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explained in detail below with reference to Figures 1 and 2, followed by the description of 
the method, system and language of the present invention. 

Figure 1 is a block diagram illustrating a preferred apparatus 1 00 
embodiment of the adaptive computing engine (ACE) architecture, in accordance the 
invention disclosed in the first related application. The ACE 100 is preferably embodied 
as an integrated circuit, or as a portion of an integrated circuit having other, additional 
components. In the preferred embodiment, the ACE 100 includes one or more 
reconfigurable matrices (or nodes) 150, such as matrices 150A through 150N as 
illustrated, and a matrix interconnection network (MEM) 1 10. Also in the preferred 
embodiment, one or more of the matrices 150, such as matrices 150A and 150B, are 
configured for functionality as a controller 120, while other matrices, such as matrices 
150C and 150D, are configured for functionality as a memory 140. While illustrated as 
separate matrices 150A through 150D, it should be noted that these control and memory 
functionalities maybe, and preferably are, distributed across a plurality of matrices 150 
having additional functions to, for example, avoid any processing or memory 
"bottlenecks" or other limitations. Such distributed functionality, for example, is 
illustrated in Figure 2. The various matrices 150 and matrix interconnection network 110 
may also be implemented together as fractal subunits, which may be scaled from a few 
nodes to thousands of nodes. 

A significant departure from the prior art, the ACE 100 does not utilize 
traditional (and typically separate) data, DMA, random access, configuration and 
instruction busses for signaling and other transmission between and among the 
reconfigurable matrices 150, the controller 120, and the memory 140, or for other 
input/output ("I/O") functionality. Rather, data, control and configuration information are 
transmitted between and among these matrix 150 elements, utilizing the matrix 
interconnection network 110, which may be configured and reconfigured, to provide any 
given connection between and among the reconfigurable matrices 150, including those 
matrices 150 configured as the controller 120 and the memory 140, as discussed in 
greater detail below. 

It should also be noted that once configured, the MM 110 also 
functions as a memory, directly providing the interconnections for particular functions, 
until and unless it is reconfigured. In addition, such configuration and reconfiguration 
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may occur in advance of the use of a particular function or operation, and/or may occur in 
real-time or at a slower rate, namely, in advance of, during or concurrently with the use of 
the particular function or operation. Such configuration and reconfiguration, moreover, 
may be occurring in a distributed fashion without disruption of function or operation, with 
computational elements in one location being configured while other computational 
elements (having been previously configured) are concurrently performing their 
designated function. This configuration flexibility of the ACE 100 contrasts starkly with 
FPGA reconfiguration, both which generally occurs comparatively slowly, not in real- 
time or concurrently with use, and which must be completed in its entirety prior to any 
operation or other use. 

The matrices 150 configured to function as memory 140 maybe 
implemented in any desired or preferred way, utilizing computational elements (discussed 
below) or fixed memory elements, and may be included within the ACE 1 00 or 
incorporated within another IC or portion of an IC. In the preferred embodiment, the 
memory 140 is included within the ACE 100, and preferably is comprised of 
computational elements which are low power consumption random access memory 
(RAM), but also maybe comprised of computational elements of any other form, of 
memory, such as flash, DRAM, SRAM, MRAM, ROM, EPROM or E 2 PROM. In the 
preferred embodiment, the memory 140 preferably includes direct memory access (DMA) 
engines, not separately illustrated. 

The controller 120 is preferably implemented, using matrices 150A and 
150B configured as adaptive finite state machines, as a reduced instruction set ("RISC") 
processor, controller or other device or IC capable of performing the two types of 
functionality discussed below. (Alternatively, these functions may be implemented 
utilizing a conventional RISC or other processor.) This control functionality may also be 
distributed throughout one or more matrices 150 which perform other, additional 
functions as well. In addition, this control functionality may be included within and 
directly embodied as configuration information, without separate hardware controller 
functionality. The first control functionality, referred to. as "kernel" control, is illustrated 
as kernel controller ("KARC") of matrix 150A, and the second control functionaHty, 
referred to as "matrix" control, is illustrated as matrix controller ("MARC") of matrix 
150B. The kernel and matrix control functions of the controller 120 are explained in 
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greater detail below, with reference to the configurability and ^configurability of the 
various matrices 150, and with reference to the preferred form of combined data, 
configuration and control information referred to herein as a "silverware" module. 

The matrix interconnection network 1 10 of Figure 1, and its subset 
5 interconnection networks illustrated in Figure 2 (Boolean interconnection network 210, 
data interconnection network 240, and interconnect 220), collectively and generally 
referred to herein as "interconnect", "interconnection(s)" or "interconnection network(s)", 
maybe implemented generally as known in the art, such as utilizing FPGA 
interconnection networks or switching fabrics, albeit in a considerably more varied 
1 0 fashion. In the preferred embodiment, the various interconnection networks are 

implemented as described, for example, in U.S. Patent No. 5,21 8,240, U.S. Patent No. 
5,336,950, U.S. Patent No. 5,245,227, and U.S. Patent No. 5,144,166. These various 
interconnection networks provide selectable (or switchable) connections between and 
among the controller 120, the memory 140, the various matrices 150, and the 
1 5 computational units 200 and computational elements 250, providing the physical basis for 
the configuration and reconfiguration referred to herein, in response to and under the 
control of configuration signaling generally referred to herein as "configuration 
information". In addition, the various interconnection networks (1 10, 210, 240 and 220) 
provide selectable or switchable data, input, output, control and configuration paths, 
20 between and among the controller 120, the memory 140, the various matrices 1 50, and 
the computational units 200 and computational elements 250, in lieu of any form of 
traditional or separate input/output busses, data busses, DMA, RAM, configuration and 
instruction busses. 

It should be pointed out, however, that while any given switching or 
25 selecting operation of or within the various interconnection networks (1 10, 210, 240 and 
220) may be implemented as known in the art, the design and layout of the various 
interconnection networks (1 1 0, 210, 240 and 220), in accordance with the ACE 
architecture are new and novel. For example, varying levels of interconnection are 
provided to correspond to the varying levels of the matrices 150, the computational units 
30 200, and the computational elements 250. At the matrix 150 level, in comparison with 
the prior art FPGA interconnect, the matrix interconnection network 1 10 is considerably 
more limited and less "rich", with lesser connection capability in a given area, to reduce 
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capacitance and increase speed of operation. Within a particular matrix 150 or 
computational unit 200, however, the interconnection network (210, 220 and 240) maybe 
considerably more dense and rich, to provide greater adaptation and reconfiguration 
capability within a narrow or close locality of reference. 
5 The various matrices or nodes 1 50 are reconfigurable and heterogeneous, 

namely, in general, and depending upon the desired configuration: reconfigurable matrix 
150A is generally different from reconfigurable matrices 150B through 150N; 
reconfigurable matrix 150B is generally different from reconfigurable matrices 150A and 
150C through 150N; reconfigurable matrix 150C is generally different from 
10 reconfigurable matrices 150A, 150B and 150D through 150N, and so on. The various 
reconfigurable matrices 150 each generally contain a different or varied mix of adaptive 
and reconfigurable computational (or computation) units (200); the computational units 
200, in turn, generally contain a different or varied mix of fixed, application specific 
computational elements (250), which may be adaptively . connected, configured and 
1 5 reconfigured in various ways to perform varied functions, through the various 
interconnection networks. In addition to varied internal configurations and 
reconfigurations, the various matrices 150 maybe connected, configured and 
reconfigured at a higher level, with respect to each of the other matrices 150, through the 
matrix interconnection network 110, also as discussed in greater detail in the first related 
20 application. 

Several different, insightful and novel concepts are incorporated within the 
ACE 100 architecture, provide a useful explanatory basis for the real-time operation of 
the ACE 100 and its inherent advantages, and provide a useful foundation for 
understanding the present invention. 

25 The first novel concepts of ACE 100 architecture concern the adaptive and 

reconfigurable use of application specific, dedicated or fixed hardware units 
(computational elements 250), and the selection of particular functions for acceleration, to 
be included within these application specific, dedicated or fixed hardware units 
(computational elements 250) within the computational units 200 (Fig. 4) of the matrices 

30 1 50, such as pluralities of multipliers, complex multipliers, and adders, each of which are 
designed for optimal execution of corresponding multiplication, complex multiplication, 
and addition functions. Through the varying levels of interconnect, corresponding 
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algorithms are then implemented, at any given time, through the configuration and 
reconfiguration of fixed computational elements (250), namely, implemented within 
hardware which has been optimized and configured for efficiency, i.e., a "machine" is 
configured in-real-time which is optimized to perform the particular algorithm. 
5 The next and perhaps most significant concept of the present invention, is 

the concept of reconfigurahle "heterogeneity" utilized to implement the various selected 
algorithms mentioned above. In accordance with the present invention, within 
computation units 200, different computational elements (250) are implemented directly 
as correspondingly different fixed (or dedicated) application specific hardware, such as 
10 dedicated multipliers, complex multipliers, and adders. Utilizing interconnect (210 and 
220), these differing, heterogeneous computational elements (250) may then be 
adaptively configured, in advance, in real-time or at a slower rate, to perform the selected 
algorithm, such as the performance of discrete cosine transformations often utilized in 
mobile communications. As a consequence, in accordance with the present invention, 
1 5 different ("heterogeneous") computational elements (250) are configured and 

reconfigured, at any given time, through various levels of interconnect, to optimally 
perform a given algorithm or other function. In addition, for repetitive functions, a given 
instantiation or configuration of computational elements may also remain in place over 
time, i.e., unchanged, throughout the course of such repetitive calculations. 
20 The temporal nature of the ACE 100 architecture should also be noted. At 

any given instant of time, utilizing different levels of interconnect (1 10, 210, 240 and 
220), a particular configuration may exist within the ACE 100 which has been optimized 
to perform a given function or implement a particular algorithm, such as to implement 
channel acquisition and control processing in a GSM operating mode in a mobile station. 
25 At another instant in time, the configuration may be changed, to interconnect other 
computational elements (250) or connect the same computational elements 250 
differently, for the performance of another function or algorithm, such as for data and 
voice reception for a GSM operating mode. Two important features arise from this 
temporal reconfigurability. First, as algorithms may change over time to, for example, 
30 implement a new technology standard, the ACE 100 may co-evolve and be reconfigured 
to implement the new algorithm. Second, because computational elements are 
interconnected at one instant in time, as an instantiation of a given algorithm, and then 
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reconfigured at another instant in time for performance of another, different algorithm, 
gate (or transistor) utilization is maximized, providing significantly better performance 
than the most efficient ASICs relative to their activity factors. This temporal 
reconfigurability also illustrates the memory functionality inherent in the MIN 1 10, as 
mentioned above. 

This temporal reconfigurability of computational elements 250, for the 
performance of various different algorithms, also illustrates a conceptual distinction 
utilized herein between configuration and reconfiguration, on the one hand, and 
programming or reprogrammability, on the other hand. Typical programmability utilizes 
a pre-existing group or set of functions, which may be called in various orders, over time, 
to implement a particular algorithm. In contrast, configurability and reconfigurability, as 
used herein, includes the additional capability of adding or creating new functions which 
were previously unavailable or non-existent. 

Next, the present invention also utilizes a tight coupling (or interdigitation) 
of data and configuration (or other control) information, within one, effectively 
continuous stream of information. This coupling or comrningling of data and 
configuration information, referred to as "silverware" or as a "silverware" module, is the 
subject of another related patent application. For purposes of the present invention, 
however, it is sufficient to note that this coupling of data and configuration information 
into one information (or bit) stream, which may be continuous or divided into packets, 
helps to enable real-time reconfigurability of the ACE 100, without a need for the (often 
unused) multiple, overlaying networks of hardware interconnections of the prior art. For 
example, as an analogy, a particular, first configuration of computational elements at a 
particular, first period of time, as the hardware to execute a corresponding algorithm 
during or after that first period of time, may be viewed or conceptualized as a hardware 
analog of "calling" a subroutine in software which may perform the same algorithm. As a 
consequence, once the configuration of the computational elements has occurred {i.e., is 
in place), as directed by (a first subset of) the configuration information, the data for use 
in the algorithm is immediately available as part of the silverware module. The same 
computational elements may then be reconfigured for a second period of time, as directed 
by second configuration information (i.e., a second subset of configuration information), 
for execution of a second, different algorithm,, also utilizing immediately available data. 
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The immediacy of the data, for use in the configured computational elements, provides a 
one or two clock cycle hardware analog to the multiple and separate software steps of 
determining a memory address and fetching stored data from the addressed registers. 
This has the further result of additional efficiency, as the configured computational 
5 elements may execute, in comparatively few clock cycles, an algorithm which may 

require orders of magnitude more clock cycles for execution if called as a subroutine in a 
conventional microprocessor or digital signal processor ("DSP"). 

This use of silverware modules, as a commingling of data and 
configuration information, in conjunction with the ^configurability of a plurality of 
10 heterogeneous and fixed computational elements 250 to form adaptive, different and 

heterogeneous computation units 200 and matrices 150, enables the ACE 100 architecture 
to have multiple and different modes of operation. For example, when included within a 
hand-held device, given a corresponding silverware module, the ACE 100 may have 
various and different operating modes as a cellular or other mobile telephone, a music 
15 player, a pager, a personal digital assistant, and other new or existing functionalities. In 
addition, these operating modes may change based upon the physical location of the 
device. For example, in accordance with the present invention, while configured for a 
first operating mode, using a first set of configuration information, as a CDMA mobile 
telephone for use in the United States, the ACE 100 may be reconfigured using a second 
20 set of configuration information for an operating mode as a GSM mobile telephone for 
use in Europe. 

Referring again to Figure 1, the functions of the controller 120 (preferably 
matrix (KARC) 150A and matrix (MARC) 150B, configured as finite state machines) 
may be explained with reference to a silverware module, namely, the tight coupling of 

25 data and configuration information within a single stream of information, with reference 
to multiple potential modes , of operation, with reference to the reconfigurable matrices 
150, and with reference to the reconfigurable computation units 200 and the 
computational elements 150 illustrated in Figure 3. As indicated above, through a 
silverware module, the ACE 100 maybe configured or reconfigured to perform a new or 

30 additional function, such as an upgrade to a new technology standard or the addition of an 
entirely new function, such as the addition of a music function to. a mobile 
communication device. Such a silverware module may be stored in the matrices 1.50 of 
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memory 140, or may be input from an external (wired or wireless) source through, for 
example, matrix interconnection network 110. In the preferred embodiment, one of the 
plurality of matrices 150 is configured, to decrypt such a module and verify its validity, 
for security purposes. Next, prior to any configuration or reconfiguration of existing 
ACE 100 resources, the controller 120, through the matrix (KARC) 150A, checks and 
verifies that the configuration or reconfiguration may occur without adversely affecting 
any pre-existing functionality, such as whether the addition of music functionality would 
adversely affect pre-existing mobile communications functionality. In the preferred 
embodiment, the system requirements for such configuration or reconfiguration are 
included within the silverware module, for use by the matrix (KARC) 150A in 
performing this evaluative function. If the configuration or reconfiguration may occur 
without such adverse affects, the silverware module is allowed to load into the matrices 
150 of memory 140, with the matrix (KARC) 150A setting up the DMA engines within 
the matrices 150C and 150D of the memory 140 (or other stand-alone DMA engines of a 
conventional memory). If the configuration or reconfiguration would or may have such 
adverse affects, the matrix (KARC) 150A does not allow the new module to be 
incorporated within the ACE 100. 

Continuing to refer to Figure 1, the matrix (MARC) 150B manages the 
scheduling of matrix 150 resources and the timing of any corresponding data, to 
synchronize any configuration or reconfiguration of the various computational elements 
250 and computation units 200 with any corresponding input data and output data. In the 
preferred embodiment, timing information is also included within a silverware module, to 
allow the matrix (MARC) 1 50B through the various interconnection networks to direct a 
reconfiguration of the various matrices 150 in time, and preferably just in time, for the 
reconfiguration to occur before corresponding data has appeared at any inputs of the 
various reconfigured computation units 200. In addition, the matrix (MARC) 150B may 
also perform any residual processing which has not been accelerated within any of the 
various matrices 150. As a consequence, the matrix (MARC) 150B maybe viewed as a 
control unit which "calls" the configurations and reconfigurations of the matrices 150, 
computation units 200 and computational elements 250, in real-time, in synchronization 
with any corresponding data to be utilized by these various reconfigurable hardware units, 
and which performs any residual or other control processing. Other matrices 1 50 may 
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20 



also include this control functionality, with any given matrix 150 capable of calling and 
controlling a configuration and reconfiguration of other matrices 150. 

Figure 2 is a block diagram illustrating, in greater detail, a reconfigurable 
matrix 150 with a plurality of computation units 200 (illustrated as computation units 
200A through 2Q0N), and a plurality of computational elements 250 (illustrated as 
computational elements 250A through 250Z), and provides additional illustration of the 
preferred types of computational elements 250. As illustrated in Figure 2, any matrix 150 
generally includes a matrix controller 230, a plurality of computation (or computational) 
units 200, and as logical or conceptual subsets or portions of the matrix interconnect 
network 1 10, a data interconnect network 240 and a Boolean interconnect network 210. 
As mentioned above, in the preferred embodiment, at increasing "depths" within the ACE 
100 architecture, the interconnect networks become increasingly rich, for greater levels of 
adaptability and reconfiguration. The Boolean interconnect network 210, also as 
mentioned above, provides the reconfiguration and data interconnection capability 
between and among the various. computation units 200, and is preferably small (i.e., only 
a few bits wide), while the data interconnect network 240 provides the reconfiguration 
and data interconnection capability for data input and output between and among the 
various computation units 200, and is preferably comparatively large (i.e., many bits 
Wide). It should be noted, however, that while conceptually divided into reconfiguration 
and data capabilities, any given physical portion of the matrix interconnection network 
110, at any given time, may be operating as either the Boolean interconnect network 210, 
the data interconnect network 240, the lowest level interconnect 220 (between and among 
the various computational elements 250), or other input, output, configuration, or 
connection functionality. 

Continuing to refer to Figure 2, included within a computation unit 200 are 
a plurality of computational elements 250, illustrated as computational elements 250A 
through 250Z (individually and collectively referred to as computational elements 250), 
and additional interconnect 220. The interconnect 220 provides the reconfigurable 
interconnection capability and input/output paths between and among the various 
computational elements 250. As indicated above, each of the various computational 
elements 250 consist of dedicated, application specific hardware designed to perform a 
given task or range of tasks, resulting in a plurality of different, fixed computational 
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elements 250. Utilizing the interconnect 220, the fixed computational elements 250 may 
be reconfigurably connected together into adaptive and varied computational units 200, 
which also may be further reconfigured and interconnected, to execute an algorithm or 
other function, at any given time,, utilizing the interconnect 220, the Boolean network 
210, and the matrix interconnection network 110. While illustrated with effectively two 
levels of interconnect (for configuring computational elements 250 into computational 
units 200, and in turn, into matrices 150), for ease of explanation, it should be understood 
that the interconnect, and corresponding configuration, may extend to many additional 
levels within the ACE 100. For example, utilizing a tree concept, with the fixed 
computational elements analogous to leaves, a plurality of levels of interconnection and 
adaptation are available, analogous to twigs, branches, boughs, limbs, trunks, and so on, 
without limitation. 

In the preferred ACE 100 embodiment, the various computational 
elements 250 are designed and grouped together, into the various adaptive and 
reconfigurable computation units 200. In addition to computational elements 250 which 
are designed to execute a particular algorithm or function, such as multiplication, 
correlation, clocking, synchronization, queuing, sampling, or addition, other types of 
computational elements 250 are also utilized in the preferred embodiment. As illustrated 
in Fig. 2, computational elements 250A and 250B implement memory, to provide local 
memory elements for any given calculation or processing function (compared to the more 
demote" memory 140). In addition, computational elements 2501, 250J, 250K and 250L 
are configured to implement finite state machines, to provide local processing capability 
(compared to the more "remote" matrix (MARC) 150B), especially suitable for 
complicated control processing. 

With the various types of different computational elements 250 which may 
be available, depending upon the desired functionality of the ACE 100, the computation 
units 200 may be loosely categorized. A first category of computation units 200 includes 
computational elements 250 performing linear operations, such as multiplication, 
addition, finite impulse response filtering, clocking, synchronization, and so on. A 
second category of computation units 200 includes computational elements 250 
performing non-linear operations, such as discrete cosine transformation, trigonometric 
calculations, and complex multiplications. A third type of computation unit 200 
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implements a finite state machine, such as computation unit 200C as illustrated in Figure 
2, particularly useful for complicated control sequences, dynamic scheduling, and 
input/output management, while a fourth type may implement memory and memory 
management, such as computation unit 200A as illustrated in Fig. 2. Lastly, a fifth type 
of computation unit 200 may be included to perform bit-level manipulation, such as for 
encryption, decryption, channel coding, Viterbi decoding, and packet and protocol 
processing (such as Internet Protocol processing). In addition, another (sixth) type of 
computation unit 200 may be utilized to extend or continue any of these concepts, such as 
bit-level manipulation or finite state machine manipulations, to increasingly lower levels 
within the ACE 100 architecture. 

In the preferred embodiment, in addition to control from other matrices or 
nodes 150, a matrix controller 230 may also be included or distributed within any given 
matrix 150, also to provide greater locality of reference and control of any reconfiguration 
processes and any corresponding data manipulations. For example, once a 
reconfiguration of computational elements 250 has occurred within any given 
computation unit 200, the matrix controller 230 may direct that that particular 
instantiation (or configuration) remain intact for a certain period of time to, for example, 
continue repetitive data processing for a given application. 

With this foundation of the preferred adaptive computing architecture 
(ACE), the need for the present invention is readily apparent, as there are no adequate or 
sufficient high-level programming languages which are available to fully exploit such 
adaptive hardware. The Q language of the present invention, for example, provides 
program constructs in a high-level language that allow detailed description of concurrent 
computation, without requiring the complexity of a hardware description language. One 
of the goals of the Q language is to incorporate language features which allow a compiler 
to.make efficient use of the adaptive hardware to create concurrent computations at the 
operator level and the task level. Figure 3 illustrates the role of the Q language in the 
context of the ACE architecture, and beginning with the exemplary data flow graph of 
Figure 4, the new and novel features of the present invention are discussed in detail. 

It should be noted that in the following discussion, and with regard to the 
present invention in general, the important features are the mechanisms and the semantics 
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of the mechanisms, such as for the dataflow statements, channels, stream variables, state 
variables, unroll statements, and iterators, rather than the particular syntax involved. 

Figure 3 is a block diagram depicting the role of Q language in providing 
for configuration of computational units, in accordance with the present invention. Figure 
5 3 depicts the progress of an algorithm (function or operation) 300, coded in the high-level 
Q language 305, through a plurality of system design tools 310, such as a scheduler and Q 
compiler 320, to its final inclusion as part of an adaptive computing IC (ACE) 
configuration bit file 335, which contains the configuration information for adaptation of 
an adaptive computing circuit, such as the ACE 100. The system design tools 310, which 
10 include a hardware object "creator", a computing operations "scheduler" and an operation 
"emulator" are the subject of other patent applications. Relevant to the present invention 
are the scheduler and Q compiler 320 component. Components of an adaptive computing 
circuit are initially defined as hardware "objects", and in this instance, specifically as 
adaptive computing objects 325. Once the algorithm, function or operation (300) has 
1 5 been expressed in the Q language (305), the scheduler portion of scheduler and Q 
compiler 320 arranges (or schedules) the programmed operations with or across the 
adaptive computing objects 325, in a sequence across time and across space, in an 
iterative manner, producing one or more versions of adaptive computing architectures 
330, and eventually selecting an adaptive computing architecture as optimal, in light of 
20 various design goals, such as speed of operation and comparatively low power 
consumption. 

When the programmed operations have been scheduled across the selected 
adaptive computing architecture, the Q compiler portion of scheduler and Q compiler 320 
then converts the scheduled Q program into a bit-level information stream (configuration 

25 information) 335. (It should be noted that, as used throughout the remainder of this 

discussion, any reference to a "compiler" should be understood to mean this Q compiler 
portion of scheduler and Q compiler 320, or an equivalent compiler). Following 
conversion of the selected adaptive computing architecture into a hardware description 
340 (using any preferred hardware description language such as Verilog or VHDL) and 

30 fabrication 345, the resulting adaptive computing integrated circuit 335 may be 
configured, using the configuration information 335 generated for that adaptive 
computing architecture. 
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For example, one of the novel features of the Q language is that it can 
specify parallel execution of particular functions or operations, rather than being limited 
to sequential execution. Using defined adaptive computing objects 325, such as ACE 
computational elements, the scheduler selects computational elements and matches the 
desired parallel functions to available computational elements, or creates the availability 
of computational elements, for the function to be executed at a scheduled time, in parallel, 
across these elements. 

Figure 4 is a schematic diagram illustrating an exemplary data flow graph, 
utilized in accordance with the present invention. Algorithms or other functions selected 
for acceleration are converted into data flow graphs (DFGs), which describe the flow of 
inputs through computational elements to produce outputs. The data flow graph of Figure 
4 shows various inputs passing through.multipliers and then iterating through adders to 
produce outputs. Equipped with data flow graphs, the high-level Q code may be refined 
to improve the computing performance of the algorithm. 

As illustrated, the data flow graph describes a comparatively fine-grained 
computation, i.e., a computation composed, of relatively simple, primitive operators like 
add and multiply. As discussed below, data flow graphs may also be used at a higher 
level of abstractions that describe more coarse-grained computations, such as those 
composed of complex operators like filters. These operators typically correspond to tasks 
that may comprise many instances of the more fine-grained data flow graphs. 

For example, a digital signal processing ("DSP") system involves a 
plurality of operations that can be depicted by data flow graphs. Q supports the 
construction of DSP systems by utilizing computational "blocks" consisting of a plurality 
of programmed DFGs that communicate with each other via data "streams". Data are 
passed from one block to another by connecting the output streams of blocks to the input 
streams of other blocks. A DSP system operates efficiently by running the individual 
blocks when input data are available, which then produces output data used by other 
blocks. Blocks may be executed concurrently, as determined by a Q scheduler. (It should 
be noted that this Q scheduler is different than the system tool scheduler (of 320) 
discussed above, which schedules the compiled Q code to available computational 
elements, in space and time). 
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At its simplest, a block implements a computation that consumes some 
number of inputs and processes them to produce some number of outputs. A block in the 
Q language is an object, that is, an instance of a class. It can be loaded into a matrix, it 
has persistent data, such as stream variables and coefficients, state, and methods such as 
init () and run 0. As exemplary methods, invoking the init ( ) method initializes 
connections and performs any other system specific initialization, while the run ( ) 
method, which has no parameters, executes the block. 

As an example, a finite impulse response filter ("FIR"), commonly used in 
digital signal processing, could be implemented as a Q block. The filter coefficients, the 
input and output streams and a variable used for the input state are part of the filter state. 
The run ( ) method processes some number of inputs from an input stream, computes, and 
writes the outputs to an output stream. The run 0 method could be called many times for 
successive streams of input data, with the state of the execution saved between 
invocations. 

Treating a matrix computation as an object allows it to be run in short 
bursts instead of all at once. Because its state is persistent, execution of a computation 
object can be stopped and continued at a later time. This is vital for real-time DSP 
applications where data become available incrementally. In the example FIR filter, the 
filter can be initialized, and run on input data as it becomes available without any 
overhead to reinitialize or load data into the matrix. This also allows many matrix 
computations to concurrently share the hardware because each maintains its own data. 

The efficiency of a block's execution as measured in power usage and 
clock cycles depends upon how well the compiler can optimize the programming code to 
produce a configuration bit file that directs parallel execution of operations while 
minimizing memory accesses. Q contains constructs that allow the programmer to 
expose the parallelism of the computation to the compiler in a block, and to compose a 
digital signal processing system as a collection of blocks, supporting both types of data 
flow mentioned above. 

The overall goal of the Q language is to support systems that are 
implemented partly in hardware using either the adaptive computing architecture or 
parameterized hardwired components, and which may also be implemented partly in 
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software on a conventional processor. Q primarily supports the construction of DSP 
systems via the composition of computational blocks that communicate via data streams. 
These blocks are compiled to run either on the host processor or in the adaptive 
computing architecture. This flexibility of implementation supports code reuse and 
5 flexible system implementation as well as rapid system prototyping using a software only 
solution. When a block is compiled to the adaptive computing architecture, the compiler 
attempts to produce an efficient parallel version that minimizes memory accesses. How 
well the compiler can do this generally depends on how the block is written: as 
mentioned above, Q contains constructs that allow the programmer to expose the 
1 0 parallelism of the computation to the compiler. 

The blocks of the present invention follow a reactive dataflow model, 
removing data from input streams and processing it to produce data on output streams. 
Data is passed from one block to another by connecting the output streams of blocks to 
the input streams of other blocks. The entire system operates by running the individual 
1 5 blocks when their input data are available, which then produces output data used by other 
blocks. The scheduling of blocks can either be done statically at compile time in the case 
of well-behaved data flow systems such as synchronous data flow, or dynamically in the 
more general case. The scheduler can be supplied either by the system software, which 
uses information supplied by the blocks about its I/O characteristics, or it can be left to 
20 the user program. In order for a system to be scheduled automatically, the blocks should 
publish their I/O characteristics. 

A stream carrying data between two blocks is implemented as a channel, 
which contains a buffer to store data items in transit between the blocks as well as 
information about the size of the buffer and the number of items in it. Blocks producing 
25 data use an output stream to send data through a channel to the input stream of another 
block. When a block writes data to an output stream, the data is stored in the channel 
where it becomes available to the input stream. When a block reads data from an input 
stream, it is removed from the channel. Thus the channel implements the FIFO implicit 
in dataflow graph arcs. The channel buffer is typically implemented using shared buffers 
30 so that no data copying is necessary: the writing block writes data directly into the buffer 
and the reading block reads it directly from the buffer. 
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Streams are declared to carry a specific data type which may be a built-in 
type or user-defined such as a class object or an array. Reads and writes are done on 
items of the data type and the channel buffer is sized in terms of how many data items it 
contains. A stream data item may be as simple as a number or as complex as an array of 
data. Reading an input stream normally consumes a data item and writing an output 
stream produces a data item to the stream. However, for complex data items where the 
item may be processed incrementally, an open can be done to get a handle to the next 
item of the stream without consuming or producing it. After the item has been processed, 
a close is used to complete the read or write. More complex operations may also be 
supported, such as reading ahead or behind the current location in the stream. However, 
such operations make assumptions about the streams that are difficult for a scheduler to 



In order for the scheduler to be able to construct a schedule, a block should 
publish its I/O characteristics and its computation timing. This information can be used 
by a scheduler at compile time to construct a static schedule, or at run time for dynamic 
scheduling. Such information can be used as preconditions that must be met before a 
block is executed. For example, the precondition might be that there are eight data items 
available on the input stream and space for eight data items on the output stream. 

Streams may be declared to be non-blocking (the default) or blocking. 
Non-blocking is the default for dataflow systems where scheduling is done to ensure that 
no blocking can occur. In this case reading an empty stream or writing a full stream is an 
error. Blocking only makes sense where blocks can run in parallel or where block 
execution can be suspended to allow other blocks to supply the needed data. Blocking is 
implemented in hardware for hardware blocks. Note that streaming I/O can be used to 
implement double-buffering, either blocking or non-blocking. In this case, the channel 
buffer contains space for two items (which can be arrays) where the output stream can be 
writing one array while the input stream reads the other. 

The stream buffer sizes depend on the relative rates at which blocks 
produce and consume data. Normally dataflow blocks are written in terms of the 
computation corresponding to one time step, sample or frame. For example, a filter 
would consume the next input sample, producing the corresponding output sample. 
Implementing a system at such a fine-grained level might be very inefficient, however. 
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The programmer may decide for efficiency reasons that every invocation of a block will 
compute many data samples; however, larger buffers are needed to store the increased 
amount of I/O data. 

An application will generally comprise both signal processing components 

5 constructed as data flow graphs as described above, as well as control-oriented 

"supervisor code" that interacts with other applications and the operating system, and 
controls the overall processing required by the application. This control-oriented part of 
the application would be written in the usual procedural style, as known in the art. This 
supervisor code may execute the nodes of a dataflow graph directly, particularly when the 

10 computation produces information that changes how the computation is performed. 

The key concepts, mechanisms, constructs and syntax of the Q language 
are described in detail below. 



1. DATAFLOW STATEMENTS in the Q language 

15 q computation objects describe computations that use the adaptive 

computing architecture to apply operations to input data to produce output data. The set 
of operations are depicted in data flow graphs and are accomplished in prograrnming code 
by a plurality of assignment statements. Although some operations may be executed in 
parallel, the execution semantics are defined by the sequential ordering of assignments as 

20 they appear in a program. A compiler may perform analysis to find parallelism, or may 
not detect opportunities for parallelism that maybe obvious to an experienced 
programmer. As a consequence, in accordance with the present invention, the Q 
"dataflow" statement informs the compiler that the code within braces following the 
dataflow statement describes a computation corresponding to a static, acyclic data flow 

25 graph that can be executed in parallel. Other than conditional branching performed using 
the known method of predicated execution (which moves branches into a data flow 
graph), there is no branching in the dataflow section, and no non-obvious side effects or 
aliasing that would cause data dependencies a compiler cannot detect. If the data flow 
graph is invoked as a loop body, the scheduler may schedule the data flow graphs of 

30 adjacent iterations so that they overlap and thus achieve even greater parallelism. For a 
comparatively straightforward example: 
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int sumYl; 
int sumY2; 
int sumXYl; 
int sumXY2; 
dataflow { 

sumY2 = sumY2 + sumYl; 

sumXY2 = sumXY2 + sumXYl; 

} 

The example above shows four variables of data type (or datatype) integer, 
two of which are assigned new values within a dataflow section. Because the values of 
sumY2 and sumXY2 are independent, the dataflow statement directs that the two 
operations be done in parallel. (While useful for explanatory purposes, this example is 
relatively trivial, as a compiler may recognize such an easy example; in actual practice, 
the dataflow statement is especially useful for directing a compiler or scheduler in how to 
divide large data flow graphs into units which may be scheduled in parallel). 

2. CHANNELS and BLOCKS in the Q language 

Q blocks are connected together using Q "channels", each channel an 
object with a buffer in memory for data, an input stream and an output stream. Channels 
are conceptually related to "named pipes" in the Unix operating system environment, but 
unlike named pipes, when channel data are accessed they need not be copied from the 
buffer to another location. 

hi the method of the present invention, a channel is allocated to a first 
block to use for output stream, then the channel is subsequently defined as input stream tc 
a second block, to connect the two blocks. A channel is declared with the type of data 
communicated through the channel and the size of the buffer. The following code 
fragment illustrates how two blocks are connected using a channel: 
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// Channel with buffer for 16 items of datatype fraction 
channel<fractl6> chan(16) ; 

// Connect blockA output to channel 
blockA. init (streamOut<f ractl6> (chan) ) ; 
// Connect blockB input to channel 
blockB . init (streamln<f ractl6> (chan) ) ; 
// Are there more than 4 items //in the buffer ? 
if ( chan. items () > 4) 
blockB. run () ; 

The channel also has a method that allows supervisor code to find out 
size of the buffer and how full it is. 



3. STREAM variables 

Blocks access channels via streams. A "stream" variable supports the 
streaming I/O abstraction where by each "read" of the input stream variable retrieves the 
next available value of the stream and each "write" to an output stream sends a value to 
the stream. A stream variable references a channel buffer and is implemented using an 
index that is automatically incremented whenever it is read or written. This automatic 
array indexing is accomplished by using an address generator in the adaptive computing 
architecture or other hardware. 

// Declare an input stream variable and an 

// output stream variable with a buffer of N items of 

// datatype fraction. 

streamln <fractl6> svar(N) ; 

streamOut <fractl6> svar(N); 

// Reference an input stream: 

// returns current value, advances stream. 

var = svar.readO; 

// Write to output stream: 

// sends next value, advances stream. 

svar . write (var ) ; 

// Open a stream data item for read/write without advancing 
// stream 
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var = svar. open () ; 

/ / Close an open stream data item: advances the stream 
svar. close () ; 

II Debug method: print the stream buffer, 
// showing current location 
s var. di splay 0 ; 

The relationships between blocks, channels and streams are illustrated in 
Figure 5. Block 400 A uses a stream variable 401 A to write to channel 402. Channel 402 
stores the data until the scheduler determines that enough data have accumulated to 
justify a read by block 400 B , which uses a stream variable 401 B as input. 

As described above, channels have methods that allow supervisor code to 
learn the size of the channel's buffer, and how full it is. The scheduler can then optimize 
I/O operations of the streams from/to the various blocks. Furthermore, because channel 
variables can be shared among blocks, multiple blocks can access channel data 
simultaneously, increasing parallel execution. The stream variable and a sample Q 
programs are discussed in greater detail below. 

A stream variable supports the steaming I/O abstraction where by each 
read of the input stream variable retrieves the next available value of the stream and 
each write to an output stream sends a value to the stream. A stream variable references 
a channel buffer and is implemented using an index that is automatically incremented 
whenever it is read or written. This automatic array indexing is implemented directly 
using an address generator. The following example program snippet computes a FIR 
filter using stream and state variables. Each loop iteration reads a sample from the input 
stream, computes the resulting output, and writes it to the output stream. The sample 
state variable is used keep a history of the values assigned to sample. Note that 
sample [13 refers to the current value of the sample state variable because of the 
assignment to sample before the unroll statement (discussed in greater detail below). 

streamln<fractl6> input; // Input stream of samples 

s treamOut<fractl6> output; // Output stream for results 



loop {int 1=0; KnOut; 1++) dataflow { 

// Read the next sample from input stream 
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sample = input . read ( ) ; 
sum =0.0; 

unroll (int i=0; ixnCoef; i++) { 

sum o sum + coefReg[i] * sample [nCoef-i] ; 

} 

output. write (sum); // Write result to output stream 

} 

A stream variable is usually initialized by the initQ method to reference a 
channel provided by the calling procedure. Note that channels are implemented using a 
circular buffer, that is, the stream index wraps around to the beginning of the channel 
buffer when it reaches the end. 

The read and write stream methods read and write individual data items 
in streams. For more complicated stream processing, the open method can be used to get 
* power to the next item in the stream. This pointer can then be used, for example, to 
access data items that are complex data types or arrays. The close method is then used 
to complete the open, which moves the stream index to the next data item in the stream. 
The open and close methods can also be used with output streams. By default, the 
stream is advanced by one data item by each read, write or close. In cases where the 
stream data is treated as an array, the stream must be informed via the bit 0 method 
how many data items to advance. It is important that when using openO to process blocks 
of data that the channel buffer is sized in units of the block size. In other words, it is 
important that the block of data processed by an openO does not go past the end of the 
buffer for obvious reasons. Thus, if a stream contains image data which is processed via 
an openO in blocks of 8 rows (as in the example below) then the channel buffer must be 
sized in units of 8 row blocks. 

Sometimes data needs to be accessed in more complex ways than simple 
streams allow. The following complicated example uses a combination of streams and 
iterators (discussed below) to process an image. 

streamln<fractl4> inputStr; 



/ / The inSwath array is one swath from the input stream 
fractl4 * inSwath; 
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//We will access the input swath using the 3D iterator below: 
// foreach (window in the row of windows) 
// foreach (row in the window) 
// foreach (pixel in the row) 
Qiterator<fractl4> inSwathl; // 3D iterator 

// Output access pattern is the same as for the input image 
streamOut<fractl4> outputStr; // Output stream for result swaths 

// The outSwath array is one swath written to the output stream 
fractl4 *outSwath; 

//We will access the output swath using the 3D iterator below: 
// foreach (window in the row of windows) 
// foreach (colum in the window) 
// foreach (pixel in the column) 
Qiterator<fractl4> outSwathl; // 3D iterator 

inputStr.init(8*imageWidth); // InitsO initializes the stream 
outputStr . init ( 8*imageWidth) ; 



fractl4 dataln[8]; 
fractl4 dataOut[8]; 

// Get next swath from input stream and initialize iterator 
inSwath = inputstr . open ( ) ; 

// Treat the input swath as a 3D array [row, window, col] 

inSwathl . init (inSwath, 

1 0, 1, 8, // rows in window 

2, 0, 1, imageWidth/8 , // windows on row 
0 0, 1, 8) ; // columns in window 

// Get access to next swath in output stream 
outSwath = outputStr. open 0 ; 

// Treat the output swath as a 3D array [row, window, col] 

outSwathl . init (outSwath, 

0 o, 1, 8, // rows in window 

2, 0, 1, imageWidth/8 , // windows on row 
lj 0/ lf s) ; // columns in window 
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// Loop over all windows in a row of the image 
loop (int w=0; w<imageWidth/8 ; w++) { 

loop (int row=0; row<8; row++) dataflow { 
unroll (i=0; i<8; i++) { 

dataln[i] = inSwathl .next () ; 

} 

// The row DCTs are done here™ 
} 

loop (int col=0; col<8; col++) dataflow { 
} 

// The column DCTs are done here- 



// Write the results to the output array 
unroll (i=0; i<8; i++) { 

outSwathl . next ( ) = dataOut [i] ; 

} 

} 

} 

inputStr. close () ; // We are done with the input and output 
outputStr . close ( ) ; 

All that is shown here are the details of accessing the input and output 
images - the computation has been omitted for clarity. It should also be noted that the 
particular syntax used was designed for backward compatibility with C++ as a prototype 
implementation; a myriad of other syntaxes are available and may even be clearer, and 
are within the scope of the present invention. For example, the Q code: 

inSwathl . init (inSwath, 

1, 0, 1, 8, II rows in window 

2, 0, 1, imageWidth/8, // windows on row 

0, o, l; 8); // columns in window 

may be equivalently replaced with: 
inSwathl = {1 

for( int i = 0; i < imageWidth/8 ; i++ ) 
fort int j = 0; j < 8; j++ ) 
fori int k = 0; k < 8; k++ ) 
inSwathtjl [i*8+k] 

II 
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The block processes all the 8x8 windows on an 8-row swath, producing a 
corresponding swath in the output image. Pixels in the input image are accessed in row 
major order within each 8x8 window, while pixels in the output image are written in 
column major order. Clearly, the pixels cannot be accessed in stream order, so an open { ) 
5 is used to access an entire swath. The stream init () method is used to indicate how 
many pixels are read and written by each open () /close ( ) pair for the input and output 
images. The pointer returned by the open { ) is handed to the iterator, which also indicates 
how the iteration is done. In this case, a 3 -dimensional iterator is used to define the 
windowed data structure on the image swath. Note that the iterator must be reinitialized 

1 0 for each new swath. Also note that we write the program to process single windows 
because the window data is not contiguous in the stream, while swaths are. 

In some cases, processing may require the program to read ahead on a 
stream, and then back up and read some of the data again. The rewind o method is 
provided to allow a program to back up a stream. The argument to rewind indicates how 

1 5 many data items to back up. If the argument is negative, the stream is moved forward. 
Caution must be used with rewind because if blocks are running in parallel, then the 
producing block may have already written into the buffer space vacated by the reads, 
leaving no space for the rewind () . 

20 4. STATE variables 

Q language "state" variables allow convenient access to previous values of 
a variable in a computation occurring over time. For example, a FIR filter may refer to 
the previous N values of the input variable. State variables avoid having to keep track of 
the history of a variable explicitly, thus streamlining programming code. State variables 
25 are declared as follows: 

state<type> name(N) ; 

where "type" is the data type and "name" is the name of the state variable, and "NT is a 
constant which declares how far into the past a variable value can be referenced. Arrays 
of state variables are allowed, for example: 
30 state<fractl6> X[8] (2) ; 



WO 03/091875 



PCT/US03/10946 



32 

which declares an array of 8 state variables of data type fraction, each of which keeps two 
history values. 

The value of a state variable i time units in the past (i.e. time = t-i) is referenced 
using the [] operator: 
5 sum = sum + in[i] ; 

refers to the value of in, i time steps in the past. 

A state variable is assigned using a normal assignment statement to the 
state variable without the time operator []. For example the assignment: 
state<fractl6> S(4); 
10 S = X; 

assigns a new value X to S. Each assignment to a state variable causes time to advance 
for that state variable. Time is defined for a state variable by the assignments made to it. 
When a state variable is assigned a value, time advances and the value becomes the 
previous value of the variable, i.e. S[l]. After the statement S = X; above, the value of 
1 5 S[l] is X, the previous value of S[l] becomes available as S[2], the previous value of 
S[2] is available as S[3], etc. State variables can be initialized by specifying their values 
for specific times in the past. This is done by assigning a value to X[i] to initialize the 
value of X at t-i. Assignments to a state variable using the [] notation do not advance 
time. 

20 

5. UNROLL statement 

"Unroll" statements in the Q language, in general, are utilized to provide 
for parallel execution of computations and other functions, which may otherwise be 
problematic due to the sequential nature of typical "loop" statements of the prior art. 

25 More specifically, the "unroll" statement provides for control over how a compiler 

handles a loop: on the one hand, it can be used to direct the compiler (320) to unroll the 
code before scheduling it; on the other hand, where a compiler might aggressively unroll 
a loop, the unroll statement of the invention may constrain precisely how it should be 
unrolled. "Unroll" statements in the Q language utilize the syntax and semantics utilized 

30 in C for loops, but are compiled very differently, with very different results. An unroll in 
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the Q language is converted at compile time into straight-line code, each command of 
which implicitly could be executed in parallel. Unroll parameters must be known at 
compile time and any reference to the iteration variable in the unroll body evaluates to a 



5 For example, the code fragment below assigns the value of the index of an 

array to the indexed element of the array: 

intl6 j; 
intlS a [4] ; 
10 unroll (j=0; j<4; j++) { 

a[j] = j; 

} 

is equivalent to the code 
intl6 a [4] ; 
15 aCO] - 0; 

a[l] * 
a [2] = 
a[3] * 

Unroll statements are allowed in dataflow blocks, because the entire unroll 
20 statement can in principle be executed in a single cycle if the data dependencies allow it. 
It should be noted that loop and unroll are quite different; although both run a fixed 
number of iterations, loop's are executed a number of iterations determined at run time, 
while unroll statements are elaborated into a dataflow graph at compile time. This means 
that loops cannot be part of a dataflow block because it is not known until runtime how 
25 many iterations a loop will execute (i.e., the different iterations of a loop statement must 
be executed sequentially, in contrast to the parallel execution of an unroll statement). 

In the following example, Q program code of the present invention 
computes a FIR filter using stream and state variables, and the unroll command. Each 
iteration reads a sample from the input stream, computes, and writes the result to the 
30 output stream. The sample state variable is used keep a history of the values assigned to 
sample. 

streamln<fractl6> input; // Input stream of samples 

streamOut<fractl6> output; // Output stream for results 
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loop (int 1=0; l<nOut; 1++) dataflow { 

sample = input . read {) ; // Perform parallel reads 
// from the input stream 

5 sum = 0.0; 

unroll (int i=0; i<nCoef; i++) {• 

sum = sum + coefRegli] * sample [nCoef-i] ; 

} 

output. write (sum) ; // Write result to output stream 

10 } 



6. ITERATORS 

Data for Q programs is input and output via matrices of the adaptive 
computing architecture adapted for memory functionality (or random access memories 

1 5 (RAMs) that are shared with the host processor). For purposes of the present invention, 
the only concern is that values in a memory are transferred to some form of register, and 
then transferred hack. Data are often stored in the form of arrays that are addressed using 
some addressing pattern, for example, linear order for a one-dimensional array or row- 
major order for a two-dimensional array. Q "Iterators" are special indexing variables 

20 used to access arrays in a fixed address pattern, and make efficient use of any available 
address generators. For example, a two-dimensional array can be accessed in row-major 
order using an iterator instead of the usual control structure that uses nested "for" loops. 



ram fractlS X[] ; // Two dimensional array in RAM 

25 iterator Xi(X, 0, 0, 1, 128, 

1, 0, 1, 64) ; 

sum = sum + Xi; // Retrieve the next value in the array 

In the preferred embodiment, the argument list for an iterator declaration 
contains first the array to be accessed, and then groups of four parameters for each 
30 dimension over which the array is to be iterated: 

(1) level - referring to the iteration level, in which the 0 level is the 
innermost loop and iterates the fastest; 

(2) init - referring to the initial value of the index; 
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(3) inc - referring to the amount added to the index in each iteration; and 

(4) limit - referring to the index limit for this index. 

It should be noted, however, that as mentioned above, the particular syntax employed 
may be highly variable, and many equivalent syntaxes are within the scope of the present 
5 invention. 

Each time the iterator is referenced, the next value in the array is accessed 
according to the iterator pattern. In the above example, Xi is an iterator used to reference 
X as a 128 x 64 two-dimensional array. The address pattern generated is equivalent to 
that generated by the following nested "for" loops: 

10 for (j=0; j<64; j=j+l) 

for (i=0; i<128; i=i+l) 
Xli] Ej] 

It should be noted that the inner "for" statement iterates over the first dimension because 
level=0 for the first dimension. Although the compiler can often implement array 
1 5 indexing with an address generator, iterators expose the deterministic address pattern 

directly to the compiler for situations that are top complex. This action reduces the work, 
i. e., clock cycles, expended to reference an array. 



7. LOOP statement 

20 The Q "loop" statement is defined to have the same syntax utilized in the 

C "for" statement. However, Q loops are restricted to execute a fixed number of times, 
determined at run time. More precisely, in the statement: 

loop (int i=0; i<n; i=i+c) { 
25 s - s + datal; 

} 

the iteration variable i and the loop limit n and increment value c cannot be modified in 
the loop body. Moreover, in the preferred embodiment, there is no mechanism to break 
30 out of the loop before the predetermined number of iterations have executed. Without a 
means to branch from a loop statement, computing overhead, and thus processing time, is 
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reduced. Other efficient control mechanisms, however, may be implemented in the 

adaptive computing architecture. 

Figures 6 A, 6B and 6C are diagrams providing a useful summary of the Q 

programming language of the present invention. Figures 7 through 9 provide exemplary 
5 Q programs. In particular, Figure 7 provides a FIR filter, expressed in the Q language for 

implementation in adaptive computing architecture, in accordance with the present 

invention; Figure 8 provides a FIR filter with registered coefficients, expressed in the Q 

language for implementation in adaptive computing architecture, in accordance with the 

present invention; and Figures 9A and 9B provide a FIR filter for a comparatively large 
1 0 number of coefficients, expressed in the Q language for implementation in adaptive 

computing architecture, in accordance with the present invention. 

The method and system embodiments of the present invention are readily 

apparent. For example, the preferred method for programming an adaptive computing 

integrated circuit includes: 
15 ( 1 ) using a first program construct to provide for execution of a 

computational block in parallel, the first program construct defined as a dataflow 

command for informing a compiler that included commands are for concurrent 

performance in parallel; 

(2) using a second program construct to provide for automatic indexing of - 
20 reference to a channel object, the channel object for providing a buffer for storing data, 

the second program construct defined as a stream variable for referencing the channel 
object; 

(3) using a third program construct for maintaining a previous value of a 
variable between process invocations, the third program construct defined as a state 

25 variable for mamtaining a plurality of previous values of a variable after the variable has 
been assigned a plurality of current values (for example, mamtaining the "N" most recent 
values assigned to the variable); 

(4) using a fourth program construct to provide for iterations having a 
predetermined number of iterations at a compile time, the fourth program- construct 

30 defined as an unroll command for fransfonning a loop operation into a predetermined 
plurality of individual executable operations; 
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(5) using a fifth program construct to provide array accessing, the fifth 
program construct defined as an iterator variable for accessing the array in a 
predetermined, fixed address pattern; and 

(6) using a sixth program construct to provide for a fixed number of loop 
5 iterations at run time, the sixth program construct defined as a loop command for 

informing a compiler that the included commands contain no branching to locations 
outside of the loop and that a plurality of loop conditions cannot be changed. 

Also for example, the first program construct may be viewed as having a 
semantics including a first program construct identifier, such as the "dataflow" identifier; 

1 0 a commencement designation and a termination designation following the first program 
construct identifier, such as "{" and "}", respectively, or another equivalent demarcation; 
and a plurality of included program statements contained within the commencement 
designation and the termination designation. 

The system of the present invention, while not separately illustrated, may 

15 be embodied, for example, in a computer, a workstation, or any other form of computing 
device, whether have processor-based architecture, an ASIC-based architecture, an 
FPGA-based architecture, or an adaptively-based architecture. The system may further 
include compilers and schedulers, as discussed above. 

Numerous advantages of the present invention are readily apparent. The 

20 present invention provides for a comparatively high-level progrannning language, for 
enabling ready programmability of adaptive computing architectures, such as the ACE 
architecture. The Q programming language is designed to be backward compatible with 
and syntactically similar to widely used and well known languages like C++, for 
acceptance within the engineering and computing fields. More importantly, the method, 

25 system, and Q language of the present invention provides new and specialized program 
constructs for an adaptive computing environment and for maximizing the performance 
of an ACE integrated circuit or other adaptive computing architecture. 

The language, system and methodology of the present invention, include 
program constructs that permit a programmer to define data flow graphs in software, to 

30 provide for operations to be executed in parallel, and to reference variable states in a 
straightforward manner. The invention also includes mechanisms for efficiently 
referencing array variables, and enables the programmer to succinctly describe the direct 
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data flow among matrices, nodes, and other configurations of computational elements and 
computational units. Each of these new features of the invention provide for effective 
programming in a reconfigurable computing environment, facilitating a compiler to 
implement the programmed algorithms efficiently in adaptive hardware. 

From the foregoing, it will be observed that numerous variations and 
modifications maybe effected without departing from the spirit and scope of the novel 
concept of the invention. It is to be understood that no Hmitation with respect to the 
specific methods and apparatus illustrated herein is intended or should be inferred. It is, 
of course, intended to cover by the appended claims all such modifications as fall within 
the scope of the claims. 
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Wee 

1 . A method for programming an integrated circuit, the method comprising: 

(a) using a first program construct to provide for execution of a 
computational block in parallel; 

(b) using a second program construct to provide for automatic indexing of 
reference to a buffer object; 

(c) using a third program construct for maintaining a previous value of a 
variable between process invocations; and 

(d) using a fourth program construct to provide for iterations having a 
predetermined number of iterations at a compile time. 

2. The method of claim 1, wherein step (a) further comprises: 
using a dataflow command for informing a compiler that included 

commands are for concurrent performance in parallel. 

3 . The method of claim 1 , wherein step (b) further comprises: 

using a channel object for providing a buffer for storing data; and 
using a stream variable for referencing the channel object. 



20 4. The method of claim 3, wherein the channel object is a buffer instantiated 

with a declared data type and a size, and wherein the stream variable is declared with a 
buffer of a plurality of data items of a specified data type. 

5 . The method of claim 1 , wherein step (c) further comprises: 

25 using a state variable for maintaining a plurality of previous values of a 

variable after the variable has been assigned a plurality of current values. 



30 



6. The method of claim 1, wherein step (d) further comprises: 

using an unroll command for transforming a loop operation into a 
predetermined plurality of individual executable operations. 
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7. The method of claim 1 , further comprising: 

(e) using a fifth program construct to provide array accessing with a 
predetermined address pattern. 

5 8. The method of claim 7, wherein step (e) further comprises: 

using an iterator variable for accessing the array in a predetermined, fixed 
address pattern. 



9. The method of claim 7, wherein the fifth program construct is a 

1 0 declaration which includes a plurality of arguments, the plurality of arguments including 
an iteration level, an initial value of an index, an increment added to the index for a 
repeated iteration, and an index limit. 

10. The method of claim 1 , further comprising: 

15 (f) using a sixth program construct to provide for a fixed number of loop 

iterations at run time. 

1 1 . The method of claim 1 0, wherein step (f) further comprises: 

using a loop command for informing a compiler that a plurality of included 
20 commands contain no branching to locations outside of the loop and that a plurality of 
loop conditions are fixed. 

12. The method of claim 1, wherein the first program construct has a 
semantics comprising: 

25 a first program construct identifier, followed by a plurality of included 

program statements. 

1 3 . The method of claim 1 2, wherein the first program construct has a syntax 
comprising: 

30 a dataflow designation; 
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a commencement designation and a termination designation following the 
dataflow designation; and 

the plurality of included program statements contained within the 
commencement designation and the tennination designation. 

5 

14. The method of claim 1 , wherein the fourth program construct has a 
semantics comprising: 

a fourth program construct identifier having a plurality of arguments, 
followed by program statements for expansion into a plurality of individual commands 
1 0 according to the plurality of arguments. 

15. A system for programming an integrated circuit, the system comprising: 
means for using a first program construct to provide for execution of a 

computational block in parallel; 
1 5 means for using a second program construct to provide for automatic 

indexing of reference to a buffer object; 

means for using a third program construct for maintaining a previous value 
of a variable between process invocations; and 

means for using a fourth program construct to provide for iterations having 
20 a predetermined number of iterations at a compile time. 

1 6. The system of claim 1 5, wherein the means for using the first program 
construct further comprises: 

means for using a dataflow command for informing a compiler that 
25 included commands are for concurrent performance in parallel. 

17. The system of claim 15, wherein the means for using the second program 
construct further comprises: 

means for using a channel object for providing a buffer for storing data; 

30 and 

means for using a stream variable for referencing the channel object. 
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18. The system of claim 17, wherein the channel object is a buffer instantiated 
with a declared data type and a size, and wherein the stream variable is declared with a 
buffer of a plurality of data items of a specified data type. 

5 

19. The system of claim 1 5, wherein the means for using the third program 
construct further comprises: 

means for using a state variable for maintaining a plurality of previous 
values of a variable after the variable has been assigned a plurality of current values. 

10 

20. The system of claim 1 5, wherein the means for using the fourth program 
construct further comprises: 

means for using an unroll command for transforming a loop operation into 
a predetermined plurality of individual executable operations. 

15 

2 1 . The system of claim 1 5, further comprising: 

means for using a fifth program construct to provide array accessing with a 
predetermined address pattern. 

20 22. The system of claim 21 , wherein the means for using the fifth program 

construct further comprises: 

means for using an iterator variable for accessing the array in a 
predetermined, fixed address pattern. 

25 23. The system of claim 21, wherein the fifth program construct is a 

declaration which includes a plurality of arguments, the plurality of arguments including 
an iteration level, an initial value of an index, an increment added to the index for a 
repeated iteration, and an index limit. 



30 
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24. The system of claim 15, further comprising: 

means for using a sixth program construct to provide for a fixed number of 
loop iterations at run time. 

5 25 . The system of claim 24, wherein the means for using the sixth program 

construct further comprises: 

means for using a loop command for informing a compiler that a plurality 
of included commands contain no branching to locations outside of the loop and that a 
plurality of loop conditions are fixed. 

10 

26. The system of claim 1 5, wherein the first program construct has a 
semantics comprising: 

a first program construct identifier, followed by a plurality of included 
program statements. 

15 

27. The system of claim 26, wherein the first program construct has a syntax 



a dataflow designation; 

a commencement designation and a termination designation following the 
20 dataflow designation; and 

the plurality of included program statements contained within the 
commencement designation and the termination designation. 



28. The system of claim 1 5, wherein the fourth program construct has a 

25 semantics comprising: 

a fourth program construct identifi er having a plurality of arguments, 
followed by program statements for expansion into a plurality of individual commands 
according to the plurality of arguments. 
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29. A programming language for programming an integrated circuit, the 
programming language comprising: 

a first program construct to provide for execution of a computational block 

in parallel; 

5 a second program construct to provide for automatic indexing of reference 

to a buffer object; 

a third program construct for maintaining a previous value of a variable 
between process invocations; and 

a fourth program construct to provide for iterations having a predetermined 
1 0 number of iterations at a compile time. 

30. The prograrnrning language of claim 29, wherein the first program 
construct further comprises: 

a dataflow command for informing a compiler that included commands are 
1 5 for concurrent performance in parallel. 



3 1 . The programming language of claim 29, wherein the second program 

construct further comprises: 

a channel object for providing a buffer for storing data; and 
20 a stream variable for referencing the channel object. 



32. The programming language of claim 3 1 , wherein the channel object is a 
buffer instantiated with a declared data type and a size, and wherein the stream variable is 
declared with a buffer of a plurality of data items of a specified data type. 

25 

33. The programming language of claim 29, wherein the third program 
construct further comprises: 

a state variable for maintaining a plurality of previous values of a variable 
after the variable has been assigned a plurality of current values. 



30 
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34. The programming language of claim 29, wherein the fourth program 
construct further comprises: 

an unroll command for transforming a loop operation into a predetermined 
plurality of individual executable operations. 

5 

35. The programming language of claim 29, further comprising: 

a fifth program construct to provide array accessing with a predetermined 
address pattern. 

10 36. The programming language of claim 35, wherein the fifth program 

construct further comprises: 

an iterator variable for accessing the array in a predetermined, fixed 
address pattern. 

15 37. The programming language of claim 35, wherein the fifth program 

construct is a declaration which includes a plurality of arguments, the plurality of 
arguments including an iteration level, an initial value of an index, an increment added to 
the index for a repeated iteration, and an index limit. 

20 38. The prograrnming language of claim 29, further comprising: 

a sixth program construct to provide for a fixed number of loop iterations 

at run time. 

39. The programming language of claim 38, wherein the sixth program 

25 construct further comprises: 

a loop command for Morrning a compiler that a plurality of included 
commands contain no branching to locations outside of the loop and that a plurality of 
loop conditions are fixed. 

30 40. The programming language of claim 29, wherein the first program 

construct has a semantics comprising: 
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a first program construct identifier, followed by a plurality of included 
program statements. 

41 .. The programming language of claim 40, wherein the first program 

construct has a syntax comprising: 

a dataflow designation; 

a commencement designation and a temunation designation following the 
dataflow designation; and 

the plurality of included program statements contained within the 
commencement designation and the termination designation. 

42. The prograrrmiing language of claim 29, wherein the fourth program 

construct has a semantics comprising: 

a fourth program construct identifier having a plurality of arguments, 
followed by program statements for expansion into a plurality of individual commands 
according to the plurality of arguments. 

43 a method for programming an adaptive computing integrated circuit, the 

method comprising: 

using a first program construct to provide for execution of a computational 
block in parallel, the first program construct defined as a dataflow command for 
informing a compiler that included commands are for concurrent performance in parallel; 

using a second program construct to provide for automatic indexing of 
reference to a channel object, the channel object for providing a buffer for storing data, 
the second program construct defined as a stream variable for referencing the channel 
object; 

using a third program construct for maintaining a previous value of a 
variable between process invocations, the third program construct defined as a state 
variable for maintaining a plurality of previous values of a variable after the variable has 
been assigned a plurality of current values; 
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using a fourth program construct to provide for iterations having a 
predetermined number of iterations at a compile time, the fourth program construct 
defined as an unroll command for transforming a loop operation into a predetermined 
plurality of individual executable operations; 
5 using a fifth program construct to provide array accessing, the fifth 

program construct defined as an iterator variable for accessing the array in a 
predetermined, fixed address pattern; and 

using a sixth program construct to provide for a fixed number of loop 
iterations at run time, the sixth program construct defined as a loop command for 
1 0 informing a compiler that a plurality of included commands contain no branching to 
locations outside of the loop and that a plurality of loop conditions are fixed. 

44. The method of claim 43, wherein the channel object is a buffer instantiated 
with a declared data type and a size, and wherein the stream variable is declared with a 

1 5 buffer of a plurality of data items of a specified data type. 

45. The method of claim 43, wherein the fifth program construct is a 
declaration which includes a plurality of arguments, the plurality of arguments including 
an iteration level, an initial value of an index, an increment added to the index for a 

20 repeated iteration, and an index limit. 

46. The method of claim 43, wherein the first program construct has a 
semantics comprising: 

a first program construct identifier; 
25 a commencement designation and a termination designation following the 

first program construct identifier; 

and a plurality of included program statements contained within the 
commencement designation and the termination designation. 



30 



47. The method of claim 43, wherein the fourth program construct has a 

semantics comprising: 
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a fourth program construct identifier having a plurality of arguments, 
followed by program statements for expansion into a plurality of individual commands 
according to the plurality of arguments. 

5 48. A prograrnming language for programming an adaptive computing 

integrated circuit, the programming language comprising: 

a first program construct to provide for execution of a computational block 
in parallel, the first program construct defined as a dataflow command for informing a 
compiler that included commands are for concurrent performance in parallel; 

1 o a second program construct to provide for automatic indexing of reference 

to a channel object, the channel object for providing a buffer for storing data, the second 
program construct defined as a stream variable for referencing the channel object, 
wherein the channel object is a buffer instantiated with a declared data type and a size, 
and wherein the stream variable is declared with a buffer of a plurality of data items of a 

15 specified data type; 

a third program construct for mamtaining a previous value of a variable 
between process invocations, the third program construct defined as a state variable for 
mamtaining a plurality of previous values of a variable after the- variable has been 
assigned a plurality of current values; 

20 a fourth program construct to provide for iterations having a predetermined 

number of iterations at a compile time, the fourth program construct defined as an unroll 
command for transforming a loop operation into a predetermined plurality of individual 
executable operations; 

a fifth program construct to provide array accessing, the fifth program 

25 construct defined as an iterator variable for accessing the array in a predetermined, fixed 
address pattern; and 

a sixth program construct to provide for a fixed number of loop iterations 
at run time, the sixth program construct defined as a loop command for informing a 
compiler that a plurality of included commands contain no branching to locations outside 

30 of the loop and that a plurality of loop conditions are fixed. 
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49. The programming language of claim 48, wherein the fifth program 
construct is a declaration which includes a plurality of arguments, the plurality of 
arguments including an iteration level, an initial value of an index, an increment added to 
the index for a repeated iteration, and an index limit 

5 

50. The programming language of claim 48, wherein the first program 
construct has a semantics comprising: 

a first program construct identifier; a commencement designation and a 
termination designation following the first program construct identifier; and a plurality of 
10 included program statements contained within the commencement designation and the 
termination designation; 

and wherein the fourth program construct has a semantics comprising: a 
fourth program construct identifier having a plurality of arguments, followed by program 
statements for expansion into a plurality of individual commands according to the 
15 plurality of arguments. 
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FIG. 3 
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FIG. BA 

var INDICATES A VARIABLE NAME? 

M INDICATES A NUMBER, IS ARCHITECTDRE DEPENDENT, AND IS 32 IN A PROTOTYPE DEVELOPMENT? 
AND 

type INDICATES A VALID TYPE. 

DATA DECLARATIONS; 

// Declare an integer variable with N bits 
integer^ var? 
unsigned integer<#> var ? 
// intl6, int32 

// Declare a fractional variable with N bits and F fractional 
bits 

fract<tf, F> var? 
// fractl6, fract32 

//RAM variables: 

// Variables declared as ram are allocated in RAM, not 

registers 

// ram tm mi 

H State variables; 

// Declare a state variable with N amount of history 
state<type> stvar [N) ; 

II Initialize a state variable explicitly with H amount of 

history 

stvar. hit(N)} 

//Initialize a state variable's value for time t-i 
stvar [i] = value? 

// Reference a state variable's value at time t-i 

var = stvar [i]; , u ^ . ^ 

II Advance time, and assign a new value to the state variable 

stvar = value? , 

// Declare a state variable array of size Nsize with N history 

values 

state<type> stvar [Nslze] (£) ? ... 
// Reference the nth variable in the state array at time t-i 
var » stvar [n] [i] ? 

II Debua method; print the current state variable history 
stvar . display ()? 
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II Channels; 

// Declare a channel for data of type with enough room for N 
items 

channel<tw?e> cvar[N)t 

II Declare a channel for H items, buffer B, and I intitial data 
items 

channel<tj2e> cvar[N,B,I); 
II Channel methods; 

cvar.size() // Channel buffer size 

cvar. items () // Number of items currently in channel 

// Stream variables; 

// Declare an input stream variable with a buffer of N items 
of type 

// [N] is omitted if no buffer is allocated 
streamln<tppe> svar(N) °, 
streamOut<rw>e> svarjN) ; 

If Stream use; 

// Reference an input streams returns current value, advances 
stream 

var = srar.readO; 

// Write to output stream; sends next value, advances stream 
svar .write (var) ? 

// Open a stream data item for read/write without advancing 
stream 

var = svar . open (); 

// Close an open stream data item; advances the stream 
5rar .close (]f 

// Rewind a stream by N data items. H can be negative 
svar, rewind (H)? 

// Debug method; print the stream buffer, showing current 

location 

svar.displayO? 

// Iterators; 

// Declare an iterator to access an array X 
iterator ivar(X, levelO, initO, incO, limitO, 

levell, initl, incl, limitl, ...)', 
II Re/Initialize an iterator 
ivar.init(X, levelO, initO, incO, limitO, 

levell, initl, incl, limitl, . . 

// Initialize the interator to its initial parameters 
ivar, reset (); 
// Iterator uses 

// Reference an array via an iterator 
var = irar.nextQ? 

// assign to an array via an iterator 
ivar. next (} ■ par; 
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FIG. 6C 



CLASSES CM BE USED TO DEFINE NEW DATA TYPES AS IN C++. THAT IS, type ABOVE 
CAN BE A PRIMITIVE OR A USER-DEFINED CLASS. 



// Loop a fixed number of times (determined at runtime) 
// Loop definition arbitrary as long as repitition number 
// can be computed before loop begins 
// Loop index cannot be used m loop body 
loop (i=0; i< $ i i++) { 
statements; 

} 

// Unroll a set of statement a fixed number of times 
// (determined at compile time) 

// Loop definition arbitrary as long as compiler can unroll 
// Loop index can be used in loop body 
unroll (i=0; i<N; i++) { 
statements; 

} 

// Tell compiler to treat a block of statements as a dataflow 
graph 
dataflow { 
statements; 

) 
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FTP 7 

template <int nCoef> 

class firl : public hardware { 

public; 

intl6 nOut; // Number of outputs requested per run() 

// Streams are used to pass coefficients and data 
ram<fractl6> coef[]; // Array of coefficients 

streamln<fractl6> input; // Input array of samples 

streamOut<fractl6> output; // Output array for results 

state<fractl6> sample (nCoeff) ; // Input values saved for last nCoef cycles 

// The init method for the fir class is used to initialize input 
// and output streams and load the coefficients 
void init (intl6 newNout, 

streamln<fractl6> newCoef , 

streamln<fractl6> newlnput, 

streamOut<fractl6> newOutput) 

nOut - newNout; // Number of outputs that run() produces 

// Initialize streams from parameters 
coef = newCoef; 
input ■ newlnput; 
output ■ newOutput; 

// Initialize the input history in the sample state variable 
unroll (int i=0; i<nCoef-l; i++) dataflow { 
sample = input . read (}; 

} 

// The 'run' method takes the next block of input samples and outputs the 
// filtered results, 
void run (void) 

fractl6 sum; // Accumulator for output values 

// On each pass, produce one output 

// This computation is one dataflow graph 

loop (int 1=0; KnOut; 1++} dataflow { 

sample ■ input. read(); // Get next sample from input stream 

// Perform single convolution 

// sample [i] refers to the value of sample at time (t-i) 
sum = 0; 

unroll (int i-0; i<?nCoef; { 

sum - sum + coef [i] * sample [nCoef-i]; 

ouput. write (sum); // Put result to output stream 

) } 
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FIG. 8 

template <int nCoef> 

class firl s public hardware { 

public; 

intl6 nOut; // Number of outputs , requested per run{) 

streamln<fractl6> coef; /./ Stream of coefficients 
streamln<fractl6> input; // Input stream of samples 
streamOut<fractl6> output, 0 // Output stream for results 

fract 
state 

ffi 1 



fractl6 coef Reg tnCoefl; 7/,Qopy of coefficients in registers 
state<fractl6> sample (nCoefb //Input values, saved for last.nCoef cycles , 
jstate<fractl6> sample; // Compiler complains, so we initialize below m 




:put) 

variable here since we can't above 

// Number of outputs that run () 

pror""*" 

// Intitialize streams from parameters 
coef = newCoef ; 
input = newlnput; 
output = newOtttput; 

II Copy the coefficients into the coefficient register; 



// Copy the coefficients into the coefficient registers 
// These wvll be saved from one invocation of run to the nexl 
// Initialize the input history m the sample state variable 
// We do this in one loop so that stream reads can be done 

unroll^int 1 ?^; i<nCoef; dataflow { 



sample = input, read Of 
thod takes the next block of input samples and outputs the 



Tl . filtered results, 
void run {void) 



// On each pass, .produce one output 



// Accumulator for output values 



>w graph 



stream 



op (mt 1=0; l<nOut? 1++) dataflow { 

sample ■ input. read(); // Read the next sample from input 

im 

// Perform, single convolution _ 

//^samjjlejij refers to the value of sample at time (t-i) 

unroll (int i=0; i<nCoef ; i++) \ 
sum ~ sum + coefRegfi] * sample [nCoef-i]; 

output. write (sum); // Write result to output stream 
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class fir2 ; public hardware { 

intl6 nOutf // Number of outputs requested 

intl6 nPassesi // Number of passes required for nOut outputs 

intl6 nCoef; // Size of coef array (runtime value) 

fractl6 *coeff // Array of coefficients 

Qiterator<fractl6> coef If 

// Input stream: We read it as a simple stream, but we have 

// to rewind the stream between iterations because we read ahead 

streamln<fractl6> inputStr; 

// Outputs are written in simple linear order 
stream0ut<fractl6> outputStr; // Output stream for results 

public; 

// The init method for the fir class is used to initialize the 
// streams from the input parameters 

void init (intl6 newNout, 

fractl6 newCoef[], intl6 newNcoef, 
streamln<fractl6> newlnput, 
stream0ut<fractl6> newOutput} 

// Establish new execution parameters 
nOut = newNout; 
nCoef = newNcoef i 

// Initialize coefficient array 
coef = newCoef °, 

II Use ID linear access for accessing coefficients and input data 
coef I. init (coef, 0, 0, 1, nCoef)? 

// Initialize streams from stream parameters 
inputStr - newlnput? 
outputStr * newOutput? 

// NPROC outputs are computed each pass 

// determine the number of passes required to produce all outputs 
nPasses - nOut/NPROC? 

}', 
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II Jf\ll^^l^^i^^ the next block of input samples and outputs the 
void run (void) 

state<fractl6> sample (NPROC) t II Input samples saved for NPROC cycles 
fractl6 sumWPROC]; // Accumulators for output values 
fractl6 curCoef; // Current coefficient value 
int i; II Compile-time variable 

// Intitialize the array starting indices 
// On each pass, produce NPROC outputs 
loop (int 1=0; KnPasses; 1++) { 
coefl.resetO; // Reset iterator to initial state (not really 

// Intialize state with inputs, and intialize sum's 
unroll (i-0? i<NPR0C; i++) dataflow { 
sum[i] = 0.0? 

if (i<(NPR0C-l)) sample = inputStr.read(); 

loop (int 1=0; KnCoef? 1++) dataflow { 
curCoef - coefl.nextO; // Get next coefficient 

sample = inputStr.readO? // Get next input sample 

// (current input is sampled]) 
// multiply by the samples by the coefficient 
// sum[0] corresponds to oldest sample 
unroll (i=0; KNPR0C; i++) { 

sum[i] = sum[ij + curCoef * sample [NPROC-i]; 

) 

// We have to back up the input stream because we need to re-read 
//of the data we just read 

// We had to read ahead by the number of coefficients - 1 
inputStr. rewind (nCoef - 1); 

// Write filtered data back to RAM 
//We could overlap the initialization for the next set of 
// of output values with this writeback 
unroll (i=0; KNPR0C; i++) dataflow { 
^ outputstr . write (sum[i] ) ; 
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